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1. INTRODUCTION 

Automatic speech recognition over the Internet faces packet loss and delay [1]. Packet loss in turn 
lead to errors in decoding the speech signal to the text [2]. This prevents TCP becoming a good solution for 
speech recognition over the Internet. TCP ensures all speech data received by the server but introduces delay 
caused by two main mechanisms, namely additive increase multiplicative decrease (AIMD) and Timeout [3]. 
To overcome the delay of TCP flow, many researches attempt to seek solutions on the real-time basis, such as 
multimedia streaming and voice over IP, by modeling TCP in certain circumstances where it may deliver a 
satisfactory performance [4-6]. Meanwhile, applications, such as one developed by Google emerge to minimize 
the TCP delay by compressing the source data and optimizing the communication protocol between the client 
and the server to make high-performance websites [7]. 

TCP-based speech recognition with speech segmenter on the client side requires acceptable delay to 
meet users’ satisfaction. Our previous work developed a model to investigate this acceptable delay for bahasa 
Indonesia and to find out a feasible working region of TCP flow on the basis of loss rate and the average round- 
trip time [8]. In other work [9], the investigation of the acceptable delay was attempted through calculations of 
the packet delay distribution model [10]. The results indicated that model [8] was more appropriate to use. 

In recent years, the solution for a real-time application over TCP is dealt with sending immediately 
the updates from the server to the client using asynchronous communication techniques, such as polling, long- 
polling, and streaming [11]. Polling is a technology in which a browser sends HTTP requests in a regular time 
interval in order to get immediately the updates from the server. A suitable example of applying this technology 
is to measure the water level and the temperature remotely [12]. A good variant of polling technology is long 
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polling. This variant emulates a push mechanism from the server. Server holds the request open until an update 
appears on the server or a timeout [13]. The next technology is streaming. It is a technology that sends a 
complete request from a browser to the server and let it open indefinitely. Neither the client nor the server 
needs to close the connection [13]. 

A more advanced technology is a full-duplex communication using WebSocket. Google developed 
WebSocket to enable the communication in a bi-directional way between the client and the server on a single 
TCP-connection [14]. WebSocket is exploited by Alumae in order to develop a full-duplex speech-to-text for 
Estonian [15]. In this system, the client sends the speech data in some containers and encodings supported by 
GStreamer framework (e.g., Ogg, MP4, and Speex) to the server. Meanwhile, the server will send the 
intermediary results, which are called hypotheses, to the client. The system uses a Kaldi online decoder for 
segmenting the speech data into speech sentences. The speech sentence is then recognized progressively by a 
speech recognizer to generate hypotheses and a final hypothesis [16]. 

A significant problem of WebSocket is when it has to pass over a proxy server [17]. A proxy server 
usually does not allow an idle connection, which is opened for a long time. Therefore, this paper proposes an 
alternative, a real-time application framework using Hypertext Transfer Protocol version 2.0 (HTTP/2) plus 
Server-Sent Event (SSE). HTTP/2 is a hypertext transfer protocol that overcomes weaknesses of HTTP/1.1, 
especially to reduce the application latency [18, 19]. In cooperation with SSE [20], the application enables the 
server to send updates to the client. The framework using HTTP/2 plus SSE is developed on the basis of 
Alumae’s application [15] 1n which the WebSocket communication between the client and the server is 
replaced by combination of HTTP/2 and SSE. Furthermore, experiments are conducted to compare the latency 
of application developed using current framework, 1.e., WebSocket and the proposed framework (HTTP/2 plus 
SSE). 


2, METHODOLOGY 
2.1. Hypertext Transfer Protocol version 2.0 (HTTP/2) 

HTTP/2 is the new version of Hypertext Transfer Protocol, which is published in May 2015 by 
Internet Engineering Task Force IETF) as RFC 7540 [18]. The main goal of HTTP/2 is to make the application 
faster, simpler, and more robust by providing these following features, 1.e., full request and response 
multiplexing; HTTP header fields compression, request prioritization, and server push. HTTP/2 does not 
modify semantically the HTTP application because it still uses all main concepts of an HTTP protocol, e.g., 
HTTP methods, header fields, status codes, and URIs. The most important enhancement of HTTP/2 is the new 
binary framing layer within an application layer as shown in Figure |. It makes HTTP messages encapsulated 
into frames and transferred over TCP channel. 
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Figure 1. HTTP/2 binary framing layer [22] 


All HTTP/2 communications run on a single TCP connection, which is able to bring one or more bi- 
directional streams. Each stream has a unique pair of identifier and priority number. This pair is used to tag 
bidirectional message of HTTP to identify stream a frame belongs to. An HTTP message could be as HTTP 
request or HTTP response. The protocol arranges the frames of different streams and interleaves them when 
sending the message. At the other end the protocol reassembles them by using the stream identifier which is 
carried by each frame in its header. 

With the framing model, HTTP/2 multiplexes the HTTP request and HTTP response by splitting 
HTTP messages into frames, interleaving, and reassembling them on the receiver as shown in Figure 2. 
HTTP/2 only needs one connection per to reduce the latency an[d improve the throughput. However, there is 
a negative consequence when TCP suffers head-of-blocking or decreasing the congestion window. Fortunately, 
this drawback can be partially compensated by the advantages of HTTP/2 mechanisms such as header 
compression and stream prioritization [21 ]. 
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The possible negative effects of this technique as described in RFC 6202 (2011) [13]. 
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Figure 2. HTTP/2 request- and response multiplexing within a shared connection [22] 


2.2. Server-Sent Events (SSE) 

Server-Sent Events (SSE) is a technology whereby a browser receives automatic updates from a server 
via HTTP connection. SSE has EventSource API, which is standardized as part of HTMLS by World Wide 
Web Consortium (W3C) [20]. The specification of SSE defines an API that allows servers to push data to Web 
pages over HTTP in the form of Document Object Model (DOM) events. The data is encoded as “‘text/event- 
stream” content and pushed by using an HTTP streaming mechanism. However, the specification suggests to 
disabling the HTTP chunking for serving event streams unless the rate of messages 1s high enough to avoid. 


2.3. Application Framework with HTTP/2 plus SSE 

The proposed real-time application framework using HTTP/2 plus SSE is developed as a modification 
of a full-duplex application framework using WebSocket, as shown in Figure 3. The framework consists of 
two main components: client and server. A client acts to provide speech data and presents the recognition 
results to the user whereas a server acts to decode and to hypothesize the received speech data as well as to 
normalize them. The server is divided into a master-server and one or more workers. Master-server works in a 
Tornado Framework [23] and manages a set of workers connected with their status; forwarding the received 
speech data to the worker and pushing immediately the recognition results to the client. The worker serves the 
speech recognition using GStreamer Online Decoder Plugin from Kaldi Toolkit [24]. 
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Figure 3. Full-duplex application framework from Alumae 
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Figure 4. Real-time application framework for speech recognition using HTTP/2 plus SSE 


Figure 4 illustrates the real-time application framework using HTTP/2 plus SSE. The communication 
between client and master-server is done by HTTP/2 protocol in cooperation with Server-Sent Events (SSE) 
on a single TCP connection as shown in Figure 4. HTTP/2 multiplexes two streams coming from two HTTP/2 
requests. The first request is sent to ask for the speech recognition service from the server and for sending the 
speech data to be recognized. The second request is sent for SSE service in order to push the intermediary 
recognition results from the server which is called hypotheses. The user may send the speech data in any 
container and decoding in any formats supported by GStreamer, 1.e., Ogg, MP4, Speex, etc. The recognition 
results are sent to the client in JSON format. 

For interaction with the client, master-server provides three handlers, i.e., ‘MainHandler’, 
“HTTP2ChunkRecognizedHandler’ and ‘EventSource’. MainHandler handles the first request of the client in 
order to get the web page and other resources, i.e., stylesheet files dan java-script files. 
HTTP2ChunkRecognizedHandler manages the HTTP request for recognition service, which is executed by a 
worker as back-end process on the server. The speech data is forwarded by the master-server to the worker. To 
perform its work, the worker has some modules namely Decoder PipeLine, GStreamer Online Decoder (plugin) 
dan Kaldi (ASR). For multiple recognition services, the workers are placed in a pool. The master-server takes 
one of the available workers when client requests connection.. Similarly in Alumae’s application, each worker 
is connected with the master-server in a full-duplex communication so that they can run either in a local host 
or in a remote host. This communication is handled by the master-server handler called 
“WorkerSocketHandler’. Furthermore, the master-server uses the handler of EventSource for SSE service. 
Since SSE request is sent by the client, handler of EventSource will get the events to be published through 
monitoring the updates in ‘DataSource’. DataSource stores the hypotheses from worker. 


2.4. Experiment 

There are three steps in implementing the proposed framework into the Alumae’s application: 
providing HTTP/2 server support in master-server; providing three handlers in master-server (MainHandler, 
HTTP2ChunkRecognizedHandler, and EventSource); and providing the client programs for HTTP/2 plus SSE 
supported Browsers. 

Figure 5 shows an example of Web page that we use for the experiments. In this Web page, there are 
features we need to conduct performance comparison between full-duplex application and HTTP/2 plus SSE 
application. Those are audio playback; buttons for sending, the recognition service either using HTTP/2 plus 
SSE or WebSocket; part of page for speech data information, connection status and some measurements for 
latency, connection speed and real-time-factor (RTF); and part of page for presenting hypotheses and final 
result of the recognition. 

To evaluate whether the proposed framework is more efficient than the full-duplex framework we 
developed an experimental environment by using an ns-3 based emulation platform [25], as shown in 
Figure 6. The environment employed a network simulation that is also used for validating the analytical model 
and explained in our previous work [8]. To simulate a real network a tap device and an ns-3 tap bridge is 
employed in that environment. The real node of client runs on the virtual-host (Ubuntu 16.04 LTS) in a virtual 
system connected by a linux-bridge to the tap device prepared on the linux-host. The tap device of the client is 
connected to the simulated network in a TapBridge UseBridge mode. Meanwhile, the real node of server runs 
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on the linux-host by using a tap device connected to the simulated network in a TapBridge UseLocal mode. 


The detailed information about the ns-3 tap-bridge modes can be found in ns-3 documentation [26]. 
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File: Track0O1.wayv, File size: 15951 kB, File duration: 90.43 second 


Raw data to be sent:15951378 
Total time: 64.03 second Connection speed: 249 kBps RTF: 0.71 


Final Recognition Result: 

a touchdown and one students board level three by michael mccarty gender card and helen send this cell study on new program 
contains report is that correspond to the level three students book published by cambridge university press this week. for his 
cabinet. the u.. self study listening unit wanted. listen to the conversation on page six alexis and jacob are talking about jacobs 
roommate. in zero eight thousand you really working out when you don't see that much have been really mean he's always 
working you know at the library years sitting at the computer without at least he’s not always really wild parties are playing music 
gone night. the air and he's pretty easy going i'm always borrowing his stuff and he doesn't mind he sounds better than my old 
roommate she was so unpleasant your ride she was pretty bad yeah she was always talking about people behind their backs you 
mean like we're doing right now. 


Hypotheses List: _ 
... d and helen send this cell study on new program contains report is that correspond to the level three students book published 
by cambridge university press. 


a touchdown and one students board level three by michael mccarty gender card and helen send this cell study on new | 
program contains report is that correspond to the level three students book published by cambridge university press 
this week. 


for his. 
for his company like. 
for his cabinet. 





Figure 5. Application website of TCP-based speech recognition 
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Figure 6. Experimental environment on simulated ns-3 network 


Furthermore, the experiments are performed in four simulated network settings to obtain various 
propagation delays and loss rates, as shown in Table 1. For each setting, we used ten speech data samples of 
English conversations (WAV files). These samples are not linear so that they do not have relation to the 
linearity of the application latencies occurred. Nevertheless, this experiment focused on the comparison 
between two frameworks (HTTP/2 plus SSE and WebSocket). 
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Table 1. Four parameter settingsof simulated ns-3 network 


# of Sources Bottleneck Link Parameter App 
Set TCP HTTP P.Delay B.w. Buffer Delay Loss Rate 
(ms) (Mbps) (pkts) (ms) 
1 9 40 40 Sa) 50 120 0,4% - 0,5% 
2 5 30 40 3.7 50 120 0,2% - 0,3% 
3 9 40 5 5 100 50 0,4% - 0,5% 
4 5 30 5 5 100 50 0,2% - 0,3% 


3. RESULTS AND DISCUSSION 

There are two things that make the proposed framework is different from the full-duplex framework: 
It uses HTTP/2 cooperating with SSE; Its client is implemented in the browser as HTML and javascript. 
All communication are handled using protocol HTTP/2 and the speech data is sent by browser using 
XMLHttpRequest and POST method. Since HTTP/s is a new protocol and the modern browsers such as 
Chrome, Firefox may not allow it to run without security the application of this framework is usually run on 
the secured communication (TSL/SSL, HTTPS). 

Meanwhile, the full-duplex framework using WebSocket has following characters: The client is 
realized in python program and running in a terminal; WebSocket is not dependent on HTTP. It only uses 
HTTP for handshaking to get the same connection of TCP; WebSocket communication can also be secured by 
using TSL/SSL over WSS. 

The results of each experiment settings are depicted by Figure 7. Overall, the measurement results 
show that the application latencies of HTTP/2 plus SSE and WebSocket are comparable. 

The differences are occurred due to different settings in the propagation delay, the loss rate, and the 
average round-trip time, as shown in Figure 8. The Loss rate and the round-trip time are the main parameters 
of TCP-delay function. 

Based on this comparable result three points could be revealed as to why the latency of HTTP/2 plus 
SSE is not more efficient than WebSocket even though HTTP/2 has some advantages to reduce the latencies. 
Those are: 

a. The application only multiplexed two streams triggered by two HTTP/2 requests, 1.e., request to the 
speech recognition handler and request to SSE handler, as shown in Figure 9. The more number of streams 
multiplexed the more efficient the system. 

b. The compression of HTTP header fields by HPACK (Header Compression for HTTP/2) is less useful 
because the experimental application does not use a large meta-data and it does not have cookies. This 
feature will have the significant effect when the application multiplexes more requests so that it 
compresses more headers from those request and prevents the repeated header fields. 
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Figure 7. Latency of Application using HTTP/2 plus SSE versus WebSocket 
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Figure 8. Latencies comparison between HTTP/2 plus SSE vs. WebSocket in each experiment settings 


c. The compression of HTTP header fields by HPACK (Header Compression for HTTP/2) is less useful 
because the experimental application does not use a large meta-data and it does not have cookies. This 
feature will have the significant effect when the application multiplexes more requests so that it 
compresses more headers from those request and prevents the repeated header fields. 

Last, we examine whether the HTTP/2 plus SSE framework performs better than WebSocket if the 
application runs over proxy server. A proxy server SQUID 3.5.12 [27] is used. 
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events 200 h2 eventsource Other 46.8KB 58.375 | 
recognize?U... 200 h2 xhr demo torna... 1.1KB 57.33s Say 


2requests | 47.9KBtransferred 2requests | 47.9KB transferred 


Figure 9. Multiplexing of two streams (HTTP/2 request-response and Server-Sent Events) 


Figure 10 shows the WiFi-LAN based environment that consists of a client host with IP 
address192.168.100.102, a proxy host with IP address 192.168.100.104 and a server. Proxy-server runs on port 
3128 while the server proceeds on port 8888. The client is represented by Google Chrome browser, which is 
set to send the data over the proxy-server with the address 192.168.100.104:3128. All experiment rules are 
made in the configuration file ‘squid.conf’. Then, the experiments are conducted in two scenarios, 1.e., running 
application over the secured connection (HTTPS or WSS) and via the ordinary connection (HTTP or WS). 


Indonesian J Elec Eng & Comp Sci, Vol. 12, No. 3, December 2018 : 1230 — 1238 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 1237 








192.168.100.102 192.168.100.104:3128 


Internet Proxy 
Browser Server 


Figure 10. Experiment environment on network with proxy-server 
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Two conclusions can be drawn from this particular experiment. When both application using HTTP/2 
plus SSE and using WebSocket are passed via SSL/TSL connection (HTTPS/WSS) then the proxy server 
allows their traffic pass through TCP tunnel, and no proxying or caching is needed. Meanwhile when both 
applications run through other than SSL/TSL connection, then the WebSocket connection is blocked, but the 
communication using HTTP1.1 plus SSE stays connected. In the latter scenario HTTP/2 is not applicable. 


4. CONCLUSION 

In this paper, we propose a real-time application framework using HTTP/2 plus Server-Sent Event 
(SSE). The experiments are conducted to compare the latency of this proposed framework against the latency 
of a full-duplex application using WebSocket. The results conclude that the latencies from both frameworks 
are comparable. However, based on the advantages of HTTP/2 protocol and also the reason that the proxy 
server blocks the WebSocket communication especially in idle situations our proposed framework offers better 
alternative for a real-time web-based speech recognition than using WebSocket. 
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